{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pandas数据分组聚合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据聚合是数据处理的最后一步，通常是要使每一个数组生成一个单一的数值。\n",
    "\n",
    "数据分类处理：\n",
    "\n",
    " - 分组：先把数据分为几组\n",
    " - 用函数处理：为不同组的数据应用不同的函数以转换数据\n",
    " - 合并：把不同组得到的结果合并起来\n",
    " \n",
    "数据分类处理的核心：\n",
    "     groupby()函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>green</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>green</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>yellow</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>blue</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>blue</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>yellow</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>yellow</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    color  price\n",
       "0   green      4\n",
       "1   green      5\n",
       "2  yellow      3\n",
       "3    blue      2\n",
       "4    blue      1\n",
       "5  yellow      7\n",
       "6  yellow      6"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 创建DataFrame\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        'color': ['green', 'green', 'yellow', 'blue', 'blue', 'yellow', 'yellow'],\n",
    "        'price': [4, 5, 3, 2, 1, 7, 6]\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000024905F4F910>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 按 color 来进行分组\n",
    "df.groupby(by='color')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "使用.groups属性查看各行的分组情况："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'blue': [3, 4], 'green': [0, 1], 'yellow': [2, 5, 6]}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(by='color').groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>color</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>blue</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>green</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yellow</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        price\n",
       "color        \n",
       "blue        3\n",
       "green       9\n",
       "yellow     16"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 分组 + 聚合\n",
    "df.groupby('color').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "============================================\n",
    "\n",
    "练习：\n",
    "\n",
    "   假设菜市场张大妈在卖菜，有以下属性：\n",
    "   \n",
    "   菜品(item)：萝卜，白菜，辣椒，冬瓜\n",
    "   \n",
    "   颜色(color)：白，青，红\n",
    "   \n",
    "   重量(weight)\n",
    "   \n",
    "   价格(price)\n",
    "   \n",
    "1. 要求以属性作为列索引，新建一个ddd\n",
    "2. 对ddd进行聚合操作，求出颜色为白色的价格总和\n",
    "3. 对ddd进行聚合操作，分别求出萝卜的所有重量以及平均价格\n",
    "4. 使用merge合并总重量及平均价格\n",
    "\n",
    "```Python\n",
    "ddd = DataFrame(\n",
    "    data={\n",
    "        \"item\": [\"萝卜\",\"白菜\",\"辣椒\",\"冬瓜\",\"萝卜\",\"白菜\",\"辣椒\",\"冬瓜\"],\n",
    "        'color':[\"白\",\"青\",\"红\",\"白\",\"青\",\"红\",\"白\",\"青\"],\n",
    "        'weight': [10,20,10,10,30,40,50,60],\n",
    "        'price': [0.99, 1.99, 2.99, 3.99, 4, 5, 6,7]\n",
    "    }\n",
    ")\n",
    "```\n",
    "============================================"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>item</th>\n",
       "      <th>color</th>\n",
       "      <th>weight</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>萝卜</td>\n",
       "      <td>白</td>\n",
       "      <td>10</td>\n",
       "      <td>0.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>白菜</td>\n",
       "      <td>青</td>\n",
       "      <td>20</td>\n",
       "      <td>1.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>辣椒</td>\n",
       "      <td>红</td>\n",
       "      <td>10</td>\n",
       "      <td>2.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>冬瓜</td>\n",
       "      <td>白</td>\n",
       "      <td>10</td>\n",
       "      <td>3.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>萝卜</td>\n",
       "      <td>青</td>\n",
       "      <td>30</td>\n",
       "      <td>4.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>白菜</td>\n",
       "      <td>红</td>\n",
       "      <td>40</td>\n",
       "      <td>5.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>辣椒</td>\n",
       "      <td>白</td>\n",
       "      <td>50</td>\n",
       "      <td>6.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>冬瓜</td>\n",
       "      <td>青</td>\n",
       "      <td>60</td>\n",
       "      <td>7.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  item color  weight  price\n",
       "0   萝卜     白      10   0.99\n",
       "1   白菜     青      20   1.99\n",
       "2   辣椒     红      10   2.99\n",
       "3   冬瓜     白      10   3.99\n",
       "4   萝卜     青      30   4.00\n",
       "5   白菜     红      40   5.00\n",
       "6   辣椒     白      50   6.00\n",
       "7   冬瓜     青      60   7.00"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddd = pd.DataFrame(\n",
    "    data={\n",
    "        \"item\": [\"萝卜\",\"白菜\",\"辣椒\",\"冬瓜\",\"萝卜\",\"白菜\",\"辣椒\",\"冬瓜\"],\n",
    "        'color':[\"白\",\"青\",\"红\",\"白\",\"青\",\"红\",\"白\",\"青\"],\n",
    "        'weight': [10,20,10,10,30,40,50,60],\n",
    "        'price': [0.99, 1.99, 2.99, 3.99, 4, 5, 6,7]\n",
    "    }\n",
    ")\n",
    "ddd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>color</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>白</th>\n",
       "      <td>10.98</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       price\n",
       "color       \n",
       "白      10.98"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 对ddd进行聚合操作，求出颜色为白色的价格总和\n",
    "ddd.groupby('color')['price'].sum()  # Series\n",
    "ddd.groupby('color')[['price']].sum()  # DataFrame\n",
    "\n",
    "ddd.groupby('color')[['price']].sum().loc[['白']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>item</th>\n",
       "      <th>color</th>\n",
       "      <th>weight</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>萝卜</td>\n",
       "      <td>白</td>\n",
       "      <td>10</td>\n",
       "      <td>0.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>白菜</td>\n",
       "      <td>青</td>\n",
       "      <td>20</td>\n",
       "      <td>1.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>辣椒</td>\n",
       "      <td>红</td>\n",
       "      <td>10</td>\n",
       "      <td>2.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>冬瓜</td>\n",
       "      <td>白</td>\n",
       "      <td>10</td>\n",
       "      <td>3.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>萝卜</td>\n",
       "      <td>青</td>\n",
       "      <td>30</td>\n",
       "      <td>4.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>白菜</td>\n",
       "      <td>红</td>\n",
       "      <td>40</td>\n",
       "      <td>5.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>辣椒</td>\n",
       "      <td>白</td>\n",
       "      <td>50</td>\n",
       "      <td>6.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>冬瓜</td>\n",
       "      <td>青</td>\n",
       "      <td>60</td>\n",
       "      <td>7.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  item color  weight  price\n",
       "0   萝卜     白      10   0.99\n",
       "1   白菜     青      20   1.99\n",
       "2   辣椒     红      10   2.99\n",
       "3   冬瓜     白      10   3.99\n",
       "4   萝卜     青      30   4.00\n",
       "5   白菜     红      40   5.00\n",
       "6   辣椒     白      50   6.00\n",
       "7   冬瓜     青      60   7.00"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 对ddd进行聚合操作，分别求出萝卜的所有重量以及平均价格\n",
    "df1 = ddd.groupby('item')[['weight']].sum()\n",
    "\n",
    "df2 = ddd.groupby('item')[['price']].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>item</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>冬瓜</th>\n",
       "      <td>70</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>白菜</th>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>萝卜</th>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>辣椒</th>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      weight\n",
       "item        \n",
       "冬瓜        70\n",
       "白菜        60\n",
       "萝卜        40\n",
       "辣椒        60"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>item</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>冬瓜</th>\n",
       "      <td>5.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>白菜</th>\n",
       "      <td>3.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>萝卜</th>\n",
       "      <td>2.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>辣椒</th>\n",
       "      <td>4.495</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      price\n",
       "item       \n",
       "冬瓜    5.495\n",
       "白菜    3.495\n",
       "萝卜    2.495\n",
       "辣椒    4.495"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>item</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>冬瓜</th>\n",
       "      <td>70</td>\n",
       "      <td>5.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>白菜</th>\n",
       "      <td>60</td>\n",
       "      <td>3.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>萝卜</th>\n",
       "      <td>40</td>\n",
       "      <td>2.495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>辣椒</th>\n",
       "      <td>60</td>\n",
       "      <td>4.495</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      weight  price\n",
       "item               \n",
       "冬瓜        70  5.495\n",
       "白菜        60  3.495\n",
       "萝卜        40  2.495\n",
       "辣椒        60  4.495"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 使用merge合并总重量及平均价格\n",
    "display(df1, df2)\n",
    "\n",
    "df1.merge(df2, left_index=True, right_index=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
